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Abstract 

Population invariance is an important requirement of test equating. An equating function is said 
to be population invariant when the choice of (sub)population used to compute the equating 
function does not matter. In recent studies, the extent to which equating functions are population 
invariant is typically addressed in terms of practical significance and not in terms of the equating 
functions’ sampling variability. 

This paper shows how to extend the framework of kernel equating to evaluate population 
invariance in terms of statistical significance. Derivations based on the kernel method’s standard 
error formulas are given for computing the standard errors of the root mean square difference 
(RMSD) and of the simple difference between two subpopulations’ equated scores. An 
investigation of population invariance for the equivalent groups design is discussed. The 
accuracy of the derived standard errors is evaluated with respect to empirical standard errors. 
This evaluation shows that the accuracy of the standard error estimates for the equated score 
differences is better than for the RMSD and that accuracy for both standard error estimates is 
best when sample sizes are large. 

Key words: Population invariance, equating, standard errors 
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Introduction 

Population invariance is an important requirement of test equating, the process of 
adjusting the scores of test fonns so that they are comparable (Angoff, 1971; Dorans & Holland, 
2000; Kolen & Brennan, 1995; Lord, 1980; Peterson, Kolen, & Hoover, 1989). An equating 
function is said to be population invariant when the choice of (sub)population used to compute 
the equating function does not matter. The extent to which equating functions are population 
invariant is commonly addressed in terms of differences that matter (Dorans, Holland, Thayer, & 
Tateneni, 2004; Yang, 2004; Yin, Brennan, & Kolen, 2004), that is, the differences in equated 
scores that are greater than would be corrected by the score rounding that occurs before score 
reporting (Dorans & Feigenbaum, 1994). The differences-that-matter criterion indicates the 
practical, rather than the statistical, significance of equating function differences. 

The focus of this paper is on incorporating equating functions’ sampling variability into 
population invariance measures in order to evaluate population invariance with respect to 
statistical significance. The approach taken in this paper is based on the delta method, which is 
different from other approaches, including those that have conducted significance tests of scale 
score distribution differences (Segall, 1997), drawn random samples to estimate standard errors 
of equating functions (Angoff & Cowell, 1986), and utilized bootstrap resampling for item 
parameter estimates (Williams, Rosa, McLeod, Thissen, & Sanford, 1998). In this paper, 
derivations are given for computing theoretical standard errors of the root mean square 
difference (RMSD; Dorans & Holland, 2000) and the standard error of the simple difference 
between two independent subpopulations’ equated scores. These derivations are intended to be 
applied using the kernel method of observed score equating (von Davier, Holland, & Thayer, 
2004; Holland & Thayer, 1989). The kernel method is an equating method with a framework that 
is general enough to consider population invariance at each explicit step of the equating process, 
across linear and curvilinear equating functions, and across all of the major equating designs. 

The outline of this paper is as follows. First, the steps of kernel equating are reviewed and 
extended for the computation of subpopulations’ and populations’ equating functions, population 
invariance measures, and their standard errors. Next, an example of an investigation of 
population invariance for the equivalent groups design is given. This example is used to 
demonstrate and evaluate the proposed standard errors. Finally, the implications of this 
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investigation and applications of the derivations used in this paper to evaluating population 
invariance for other equating designs are discussed. 


Equivalent Groups Kernel Equating and Extensions for Population 

Invariance Measures 

Kernel equating is a unified approach to test equating based on a flexible family of 
curvilinear and linear equating functions (von Davier et ah, 2004; Holland & Thayer, 1989). 
While the kernel method can be used to compute equating functions based on any of the major 
data collection designs, the focus of this paper is on linking functions computed from the 
equivalent groups data collection design (reviewed next). In this section, the five steps of the 
kernel method are summarized for the equivalent groups design. Extensions of these steps are 
given for the consideration of population invariance in linking functions that provide (a) a 
significance test of the difference between two subpopulations ’ linking functions and (b) the 
standard error for the RMSD measure. 

The Equivalent Groups Design 

The equivalent groups design is a data collection design where two independent random 
samples are drawn from a common population of examinees, P, and one sample is administered 
test X and the other sample is administered test Y. Table 1 shows the population, samples, and 
tests administered in the equivalent groups design. 

The assumptions of the X-lo-Y equating are the following: 

1. There is a single population P of examinees who could take either test. 

2. The two samples are independently and randomly drawn from the common 
population of examinees, P. 


Table 1 

The Equivalent Groups Data Collection Design 


Population 

Sample 

TestX 

Test Y 

P 

1 



P 

2 




2 




When the equivalent groups design is extended to a consideration of subpopulations of 
the samples who take tests X and Y, the assumptions of the design are also extended. Table 2 
shows the equivalent groups design when there are G subpopulations making up population P. 
The assumptions made in the equivalent groups design for the X-io-Y equatings for P U ...,P G 
adds a third assumption: 

3. The G subpopulations are mutually exclusive and independent. 


Table 2 

The Equivalent Groups Data Collection Design for G Subpopulations 


Population 

Sample and 
subpopulation 

TestX 

Test Y 

P 

Pi 

V 



P 2 

V 




V 




V 



Pg 

V 


P 

Pi 


V 


P 2 


V 




V 




V 


Pg 


V 


The assessment of the population invariance equating assumption using population 
invariance measures is an empirical evaluation based on comparing the X-io-Y equated scores for 
Pi,...,Pg and P. The five steps of the kernel method and their extensions to consider population 
invariance are now summarized for the equivalent groups design. 

Step 1: Presmoothing, Kernel Equating 

Estimates of the univariate score distributions and the C-matrices, the factorization 
matrices of the covariance matrix of the estimated distributions, are obtained by fitting loglinear 
smoothing models (Holland & Thayer, 1989, 2000) to the raw data obtained by the data 
collection design. For the equivalent groups design, two loglinear smoothing models are used to 
preserve a number of moments, 7>and 7>, from the observed distributions of each separate test, 
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X and 7, respectively. These loglinear smoothing models produce estimated univariate 
distributions, R and S , and their corresponding root covariance C-matrices, C R and C s ,forX 
and Tin the total population P. For an equivalent groups design and an X with /possible score 
values and a Y with K possible score values, the dimensions of R and S are J-by-1 and K- by-1 
and the dimensions of C R and C s are J-by-Tx and K-by-Ty. 

Population Invariance Measures Extension of Step 1 

When kernel equating is extended to a population invariance study, two loglinear 
smoothing models for each subpopulation’s X and T score distributions are needed to obtain R p 

and S Pg , and C RPg and C SPg for all G. Since the total population’s X and T distributions are 

functions of the subpopulations’ distributions under the third assumption, smoothing models for 
the total population are not necessary. The justification of the separate smoothing models and 
outputs follows from the assumption of independent subpopulations, who consequently do not 
share the parameters of their respective smoothing models with those of any other subpopulation. 

Step 2: Estimation of the Score Probabilities, Kernel Equating 
In Step 2, a column vector of estimated score probabilities ( r and s ) is obtained from 

the estimated score distributions ( R and S ) through a design function. For the equivalent 
groups design, the estimated score probabilities are equal to the estimated score distributions 
(r =R, s = S , C r =C R , and C s = C s ). Therefore, the design function is the identity function 
and hence, the matrix of derivatives of the design function with respect to the score probabilities, 
J DF , is an identity matrix for Xs /-total score probabilities (Ij) and another identity matrix for 
Fs if-total score probabilities (Ik). For other equating designs, the design function and its 
derivative matrix (J DF ) involve additional computations for the other equating designs because 

for other equating designs, r and s are not equal to R and S . 

Population Invariance Measures Extension of Step 2 

For the extension to the population invariance measures, the estimated score probabilities 
from the smoothed distributions must be produced for each of the G subpopulations and, for the 
RMSD measure, also for the overall population. For each subpopulation, the score probabilities 
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are estimated from the smoothed distributions in exactly the same way as they are for kernel 
equating, which is simply through the design function. Once these subpopulation score 
probabilities are produced, the overall population score probabilities can be estimated as explicit 
functions of the subpopulation samples and score probabilities (Assumption 3): 


G 

'Yu n xPg^jPg 

r =_£_ 

'jP G 

X ll XPg 

g 


( 1 ) 




kP 


X 


n YPg S kPg 


G 

Z ] n YPg 


g 


( 2 ) 


PjP g =?rob{X = Xj \P g } 

n 

s kPg = Prob{7 = y k \ P g } 

n XPg is the total number of examinees taking test X in subpopulation P g , and n YPg is the total 
number of examinees taking test Y in subpopulation P g . 

Step 3: Continuation, Kernel Equating 

In Step 3, continuous approximations, F hx (x) and G hY (y ), to the estimated discrete 

cumulative density functions (cdfs), F(x) and G(y) are determined using Gaussian kernel 
smoothing (Ramsay, 1991). This step involves a choice of the bandwidth parameters, hx and hy. 
von Davier et al. (2004) suggested two criteria for selecting the bandwidth parameters: (a) the 
bandwidth parameters should produce probability density functions that closely match the 
smoothed discrete probabilities, and (b) the bandwidth parameters should produce probability 
density functions that do not have many modes. 
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Population Invariance Measures Extension of Step 3 

The continuous cdf approximations F hx (x) and G hY (y) are determined for each 
subpopulation’s cumulative distribution (cumulated Equations 1 and 2) by selecting 
subpopulation-specific bandwidth parameters h X p g and h Y p g for F hXP (x) and G hYPg (y), and also 

for the population-specific bandwidth parameters h X p and hyp for F hxp (x) and G hW (y). Each cdf 
approximation is based on the corresponding subpopulation or population score probabilities 
estimated in Step 2. 

Step 4: Equating, Kernel Equating 

The estimated equating function is formed from the continuous cdfs, F hx (x) and G hr (y), 
using the following fonnula: e Y (x) = G hY 1 (F hX (x)) . 


Population Invariance Measures Extension of Step 4 

Equating functions are computed for each subpopulation as e YPg (x) = G hYP 1 (F hxp (x)) . 

The equating function is also computed for the total population as e YP (x) = G hYP 1 (F hXP (x)) . 

The RMSD can now be computed to measure the extent to which subpopulations’ X-to-Y 
equated scores differ from the total population’s equated scores at specific values of X(xj, j = 0 
to/): 


RMSD( Xj ) = 



e yp (-'■ / )] 


( 3 ) 


For the RMSD, g defines one of the G total subpopulations ( P g ) of the total population P 

G 

(^ P g = P ), w g is the relative proportion of subpopulation P g in the total population 

g 

G y/V ^YPv 

( w„ = - y, -), eypfxj) is the linking function for X to Y for a particular score on X (xj) 

X n xp g + X n YP g 

g g 

in subpopulation e Y p{xj) is the linking function for X-to-7 for score (xj) in the total population 
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P, and Oyp is the standard deviation of the Y scores in the total population P 



Differences in two subpopulations’ equated scores may also be of interest, 
e YPX (x) - e YP2 (x) , particularly if differences in subpopulations ’ equated scores are regarded as 

more serious than the differences of subpopulations’ equated scores to the total population’s 
equated scores measured by the RMSD. 

Step 5: Calculating the Standard Error of Equating, Kernel Equating 
The delta method is used to compute a large-sample approximation of the standard error 
of equating (Bishop, Feinberg, & Holland, 1975; Kendall & Stuart, 1977). The delta method can 

be summarized by saying that if a vector of parameter estimates ( Q n ) is distributed 
approximately as iV(O,E(0)) when the variance-covariance matrix £(0) is small, then a function of 
these parameter estimates, R{ 6 n ), has an approximate N(Q,dR/c%YL{Q)dR/c$) distribution (von 
Davier et ah, 2004, p. 198). For kernel equating, the loglinear smoothing output corresponds to 
( 6 n ), the C-matrix is the factorization of 2(0), and the design and equating functions are R( 0 n ). 

Therefore, the standard error of equating reflects the smoothed distributions, the conversion of 
the smoothed distributions into score probabilities, and the bandwidth-dependent equating 
functions. For the equivalent groups design: 


SEE y (x) = 


f Se Y 

de Y ) 

( -C, 
dR R 

0 

\ dr 

ds / 

0 

V 

ds 

—c s 

dS 




( 4 ) 


f de Y de Y ^ 

where -,- are the partial derivatives of the equating function with respect to the score 

k dr ds J 

(C r 0 ^ 

probabilities of X and Y, and ^ ^ | is made up of the two C-matrices, C r and C s , computed 


in Step 1. In (4), ||.v|| = XT denotes the Euclidian length (norm) of vector x. 
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Population Invariance Measures Extension of Step 5 

Significance test for two subgroups ’ equating functions. When loglinear smoothing 
models and equating functions are computed for independent subpopulations, the differences 
between their equated scores can be evaluated with respect to a standard error of equating 
difference (SEED). The SEED used in this paper differs from what was proposed in von Davier 
et al. (2004) because the equated score differences evaluated in this paper are of independent 
subpopulations (where the C-matrices are not shared) rather than of linear and curvilinear 
equating functions (where the C-matrices are common). The SEED used in this paper is the 
square root of the sum of each subpopulation’s squared standard errors of equating (SEE; see 
Appendix A). 


SEED y (x ) == 


( de m 

YP\ 

o 

2 

_i_ 

( de YP2 

de YP 2 \ f C rP2 

0 

Sdr„ 


cj 

i 

l dr P2 

ds p2 )U 

cj 


2 


= slVar(e YPl ) + Var(e YP2 ) 


( 5 ) 


Thus, the SEED is not unique to the kernel method, but can be, and has been (Harris & 
Kolen, 1986, p. 40), computed based on the standard errors of any equating function by noting 
that the standard error of the difference of the equating functions of two independent 
subpopulations is the square root of the sum of each equating function’s variance evaluated at 
each score of X. 

The standard error of the RMSD, RMSDSE. The RMSD measure gives a value based on 
a particular X score that is a function of the estimated probabilities in the subpopulations and 
population ( RMSD(x; r pl , f P2 , ...,f PG ; s PI , s p2 ,s P( , )). The formula for the standard error for the 

RMSD (RMSDSE) can be written in a form that is general enough to apply to all of the major 
equating designs. This fonnula includes the derivatives of the RMSD and subpopulation 
equating functions with respect to all P g X and Y score probabilities, derivatives of all P g X and Y 
score probabilities with respect to the estimated (smoothed) distributions, and the factorized 
variance-covariance matrices of all P g estimated distributions: 
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RMSDSE(x ) = 


dRMSD 

dr 
ur pi 


dRMSD dRMSD 
dr PG ds PI 


dRMSD 


ds 


PG 


J DF c 


rc rP1 o 

0 C rP2 


0 

0 


Vo 

II o 


a.... o ^ 
a.o | 


dRMSD 

dr 

pi 


dRMSD dRMSD 
dr PG ds PI 


dRMSD 
ds pc 


lo 

ft. 

VpcA 

'U> 

ft. 

,.. 0 ) 

(o 

ft.... 

0 


(C„ 

0 

o ^ 

0 

ft. 

. 0 


0 

c 

V_ 'sP2 

0 

vO 

ft. 

.. 0 

) 

vO 

ft. 

W 


( 6 ) 


The derivatives of RMSD with respect to the P g X and Y score probabilities are given in 
Appendix B. An accompanying paper will capitalize on the generality provided by von Davier et 
al.’s (2004) development of J DF C Pg , and show how the RMSD standard errors can be computed in 

population invariance studies using data collection designs other than the equivalent groups design. 

Example 

In this section, the previously described population invariance measures and standard 
errors are demonstrated and evaluated using actual test data. The data were obtained in a special 
study where two 42-item exam forms were given to high school students in a spiraled 
administration. The content of the exam was English literature. These data were used to assess 
the extent of population invariance in the equating function for the two exam forms. The 
subpopulations of interest were examinees from schools that were not in large cities (PI) and 
examinees from schools that were in large cities ( P2 ). Table 3 presents the summary statistics for 
the population and subpopulations based on number-correct scores. The statistics in Table 3 
show that the large sample of PI examinees did better on the X and Y forms than did the smaller 
sample of P2 examinees. In addition, the PI examinees did slightly better on the X form than on 
the Y form while the P2 examinees did slightly better on the Y form than on the X form. 
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Table 3 

Descriptive Statistics of the Subpopulations and Population 


Subpopulations and 
population 

Test 

N 

Mean 

Std. 

dev. 

Skew 

Kurtosis 

PI 

X 

973 

24.89 

6.65 

.07 

-.47 

PI 

Y 

958 

24.85 

6.45 

.02 

-.21 

P2 

X 

296 

23.87 

7.39 

-.15 

-.45 

P2 

Y 

294 

23.92 

7.25 

-.07 

-.54 

P(=P1 +P2) 

X 

1,269 

24.65 

6.84 

.01 

-.35 

P(=P1 +P2) 

Y 

1,252 

24.63 

6.66 

-.03 

-.17 


Kernel Equating, Steps 1-4 

Loglinear smoothing models were fit to the four subpopulations’ score distributions. The 
model selection process was to consider the relative fits of five models that preserved 2 through 
6 moments using four likelihood ratio chi-square tests with alpha levels of l-(l-.05) 1/4 = .0127 
(Haberman, 1974). The selected models for X and Y in PI both preserved 6 and 5 moments and 
had likelihood ratio statistics of 17.11 ( df= 36 ,p > .05) and 27.58 ( df= 37 ,p > .05), 
respectively. The selected models for X and Y in P2 preserved 2 moments and had likelihood 
ratio statistics of 26.16 ( df= 40 ,p> .05) and 49.84 ( df= 40 ,p > .05), respectively. The score 
probabilities were obtained directly from the smoothed frequencies. 

Continuous cdfs were estimated based on the discrete and smoothed score distributions of 
the four subpopulation score distributions and two population (PI + P2) score distributions. A 
parabolic interpolation procedure (Press, Teukolsky, Vetterling, & Flannery, 1992) was used to 
select Gaussian kernel bandwidths that minimized the extent to which the continuous 
distributions deviated from the loglinear smoothed distributions, while having very few modes. 

Finally, the X-to-Y kernel equating functions for PI, P2, and P were computed, along 
with the RMSD. Figure 1 plots the equated score differences (PI - P2), along with practical 
differences-that-matter lines of +/- .5 score points. Figure 2 plots the RMSD, along with a 
practical difference-that-matters line of .5 /<j Y p. The figures indicate that population dependence 
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in the X-to-Y equating function is serious enough to create practically important differences 
between the subpopulations’ equating functions (Figure 1) and between each of the 
subpopulation’s and the population equating functions (Figure 2) for scores X= 0 through 9. 


Step 5: Standard Errors, Delta Estimates 
The population invariance measures were evaluated with respect to statistical 
significance. Figures 3 and 4 plot the equated score differences and the RMSD values, along 
with the differences-that-matter lines and +/- 2 times the delta method standard errors. Figures 3 
and 4 indicate that the equated score differences and RMSD values are statistically significant for 
scores X— 0 through 6. Both figures indicate that the population invariance measures become 
extremely variable around score X= 7, a score point where the data are unusually sparse relative 
to the frequencies suggested by the loglinear smoothing models. The estimated variability of the 
RMSD values is directly determined by the magnitude of the equated score differences (see 
Appendix B), which accounts for the abrupt decreases in its standard error at scores X= 19 and 
22, the A scores with the smallest equated score differences. 

Evaluating the Accuracy of the Delta Standard Errors 

The delta method standard error estimates were evaluated with respect to empirical 
variability. Two hundred datasets for three sample size conditions were simulated from each of 
the loglinear smoothing models selected for the four univariate distributions. For one sample size 
condition, the four sample sizes of the original data were used (Nxpi = 973, Nypi = 958, Nxp 2 = 
296, Nyp 2 = 294). For the other two sample size conditions, the four distributions were generated 
with equal sample sizes of 300 and 1,000. The kernel equating functions were then computed for 
each of these datasets and the same smoothing models and bandwidth parameters that were 
selected with the actual data. Averages of the 200 delta method standard errors were then 
computed and evaluated with respect to the standard deviations of the population invariance 
measures. This evaluation provided an estimate of empirical variability while adhering to the 
assumptions of the delta method (i.e., the assumptions that the loglinear models are the true 
models and that the same smoothing models and bandwidth parameters are used across all 
replications). 
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X-to-Y Equated Score Differences 
e YP i( x )-e yp2 ( x ) 

N XP1 =973, N X p2=296, N Y pi = 958, N YP 2 = 294 


o 


9 = 



Score 


eYP1(x)-eYP2(x) 


■ - +/-DTM 


Figure 1. X-to-Y equated score differences. 


RMSD(x) 

Nxpi = 973, N X p2=296, N yp1 =958, N YP2 =294 


1 n 

0.8 - 



0.2 - 


0 . 11 1 

0 5 10 15 20 25 30 35 40 

Score 

♦ RMSD(x) .DTM] 


Figure 2. RMSD(x). 
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RMSD 


X-to-Y Equated Score Differences 

e YPl( X )" e YP2( X ) 

N XP1 =973, N X p2=296, N YP i=958, N Y p 2 =294 



* eYP 1 (x)-eYP2(x) .+/-DTM -+/-2SEED_Delta 


Figure 3. X-to-Y equated score differences. 


RMSD(x) 

N X pi= 973, N X p2=296, N YP1 =958, N YP2 =294 



Figure 4. RMSD(x). 
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Figures 5, 6, and 7 plot the equated score differences and +/- two times the average of the 
delta method standard error estimates and +/- two times the empirical standard deviations. These 
three figures show fairly close agreement between the delta method standard error estimates and 
the empirical standard deviations. The delta method estimates are closest to the empirical 
standard deviations for the sample size condition of N = 1,000 for all four distributions 
(Figure 7). In Figure 5 (Nxpi = 973, Nypi = 958, Nxp2 = 296, Nypi = 294) and Figure 7 (N = 
1,000), the equated score differences atX= 0 through 6 are statistically significant based on the 
average delta method standard errors and also on the empirical standard deviations. In Figure 6, 
(N= 300), none of the equated score differences are statistically significant based on the average 
delta method standard errors and on the empirical standard deviations. 


SEED Evaluation Based on 200 Simulated Datasets 

e Y pi(x)-e Y p2(x) 

N X pi= 973, N X p2=296, N YP1 =958, N YP2 =294 



Figure 5. SEED evaluation based on 200 simulated datasets. 
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SEED Evaluation Based on 200 Simulated Datasets 

e Y pi(x)-e Y p2(x) 

Nxpi = 300, Nxp2 = 300, N Y p-| = 300, N Y p2 = 300 



Score 


♦ eYPI (x)-eYP2(x) 

.+/-DTM 

-+/-2AverageSEEDs_Delta 

-+/-2EmpiricalSDs 


Figure 6. SEED evaluation based on 200 simulated datasets. 


SEED Evaluation Based on 200 Simulated Datasets 
e Y pi(x)-e Y p2(x) 



Figure 7. SEED evaluation based on 200 simulated datasets. 
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Figures 8, 9, and 10 evaluate the variability estimates of the RMSD. The RMSD depends 
on the differences in populations’ sample sizes (3), so the reader should note that the RMSD 
values based on the sample sizes in the observed data (Figures 2, 4, and 8) differ from the RMSD 
values based on equal sample sizes in all four distributions (Figures 9 and 10). Figures 8, 9, and 
10 show that the average delta method standard errors consistently overestimate the RMSD’s 
empirical variability. The differences between the delta method’s standard errors and the 
empirical standard deviations are greatest when the four univariate distributions are based on 
sample sizes of 300 (Figure 9). For Figure 9, none of the RMSD values are statistically 
significant based on the average delta method standard errors, while the RMSD values at scores 
X= 3 through 5 are statistically significant (but barely) based on the empirical standard 
deviations. For Figure 8 (Nxpi = 973, Nypi = 958, Nxp2= 296, Nyp2 = 294) and Figure 10 (N = 

1,000), the delta method standard errors are sufficiently close to the empirical standard 
deviations so that they agree on which RMSD values are statistically significant (X= 0 through 
6) and insignificant (X= 7 through 42). 


RMSDSE(x) Evaluation based on 200 Simulated Datasets 
Nxpi = 973, Nxp 2 = 296, Nyp-| = 958, Nyp2“294 



Figure 8. RMSDSE(;c) evaluation based on 200 simulated datasets. 
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RMSDSE(x) Evaluation based on 200 Simulated Datasets 
Nxpi = 300, N X p2 = 300, N YP1 =300, N Y p2 = 300 



Figure 9. RMSDSE(jc) evaluation based on 200 simulated datasets. 


RMSDSE(x) Evaluation based on 200 Simulated Datasets 
Nxpi = 1 ,000, Nj(P2 = 1 jOOOj N Y pi=1,000, NYP2 = "l>d00 



Figure 10. RMSDSE(jc) evaluation based on 200 simulated datasets. 
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Discussion 


This paper demonstrates how equating functions’ sampling variabilities can be estimated 
and incorporated into evaluations of population invariance. The kernel method and its delta 
method standard errors were extended to compute the standard errors of differences between 
subpopulations’ equating functions and standard errors of the RMSD. These standard errors were 
demonstrated on actual test data and evaluated in terms of equating function variability from data 
that were simulated from the same conditions as the actual data. The delta method standard error 
estimates of equated score differences more closely approximated actual sampling variability 
than did the delta method standard error estimates of the RMSD. The delta method estimates for 
the equated score differences and the RMSD were closer to actual variability when the equating 
functions were computed based on large, rather than small, sample sizes, which was expected 
from the literature (Jarjoura & Kolen, 1985; Liou & Cheng, 1995; Liou, Cheng, & Johnson, 
1997). The delta estimates would not have been expected to work as well had they been 
evaluated with respect to other equating decisions, such as the selection of the appropriate 
loglinear model and/or bandwidth parameter. These additional decisions are not incorporated in 
the delta method estimates, but are certainly relevant to the decisions made in practice that add 
variability to all aspects of equating. 

Many extensions of this work are possible. Because the derivations given in this paper 
utilize the kernel method’s general framework, the computation of the standard errors for 
population invariance measures in all of the major equating designs are straightforward. 
Population invariance evaluations in the single group, counterbalanced, and non-equivalent 
groups with anchor test designs are more complex than in the equivalent groups design because 
they are based on bivariate frequency tables of highly correlated tests that are usually sparse and 
require more complicated loglinear models and model search strategies. In addition to extending 
this work to other equating designs, wide ranges of sample size, degrees of true and false 
population invariance, and situations with more than two subpopulations could also be 
considered. These extensions would be informative for investigations of population invariance 
that commonly evaluate population invariance with respect to practical, rather than statistical, 
significance. 


18 



References 


Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), 

Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council 
on Education. 

Angoff, W. H., & Cowell, W. R. (1986). An examination of the assumption that the equating of 
parallel forms is population-independent. Journal of Educational Measurement, 2J(4), 
327-345. 

Bishop, Y. M. M., Feinberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: 
Theory and practice. Cambridge, MA: MIT Press. 

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. 
New York: Springer-Verlag. 

Dorans, N. J., & Feigenbaum, M. D. (1994). Equating issues engendered by changes to the SAT 
and PSAT/NMSQT. In I. M. Lawrence, N. J. Dorans, M. D. Feigenbaum, M. Feryok, A. 
P. Schmitt, & N. K. Wright (Eds.), Technical issues related to the introduction of the new 
SAT and PSAT/NMSQT (ETS RM-94-10). Princeton, NJ: ETS. 

Dorans, N. J., & Holland, P. W. (2000). Population invariance and the equatability of tests: Basic 
theory and the linear case. Journal of Educational Measurement, 37, 281-306. 

Dorans, N. J., Holland, P. W., Thayer, D. T., & Tateneni, K. (2004). Invariance of score linking 
across gender groups for three Advanced Placement Program examinations. In N. J. 
Dorans (Ed.), Population invariance of score linking: Theory and applications to 
Advanced Placement Program examinations (ETS RR-03-27, pp. 79-118). Princeton, 

NJ: ETS. 

Haberman, S. J. (1974). Log-linear models for frequency tables with ordered classifications. 
Biometrics, 30, 589-600. 

Harris, D. J., & Kolen, M. J. (1986). Effect of examinee group on equating relationships. Applied 
Psychological Measurement, 10(1), 35M3. 

Holland, P. W., King, B. F., & Thayer, D. T. (1989). The standard error of equating for the 

kernel method of equating score distributions (ETS Technical Report 89-83). Princeton, 
NJ: ETS. 

Holland, P. W., & Thayer, D. T. (2000). Univariate and bivariate loglinear models for discrete 
test score distributions. Journal of Educational and Behavioral Statistics, 25, 133-183. 


19 



Holland, P. W., & Thayer, D. T. (1989). The kernel method of equating score distributions (ETS 
Technical Report 89-84). Princeton, NJ: ETS. 

Jarjoura, D., & Kolen, M. J. (1985). Standard errors of equipercentile equating for the common 
item nonequivalent populations design. Journal of Educational Statistics, 10, 143-160. 
Kendall, M., & Stuart, A. (1977). The advanced theory of statistics (4th ed., Vol. 1). New York: 
Macmillan. 

Kolen, M. J., & Brennan, R. J. (1995). Test equating: Methods and practices. New York: 
Springer-Verlag. 

Liou, M., Cheng, P. E., & Johnson, E. G. (1997). Standard errors of the kernel equating methods 
under the common-item design. Applied Psychological Measurement, 27(4), 349-69. 
Liou, M., & Cheng, P. E. (1995). Asymptotic standard error of equipercentile equating. Journal 
of Educational and Behavioral Statistics, 20, 259-286. 

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, 
NJ: Erlbaum. 

Peterson, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, nonning, and equating. In R. L. 

Linn (Ed.), Educational measurement (3rd ed. pp 221-262). New York: Macmillan. 

Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in 
C: The art of scientific computing (2nd ed.). New York: Cambridge University Press. 
Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve 
estimation. Psychornetrika, 56(4), 611-630. 

Segall, D. O. (1997). Equating the CAT-ASVAB. In W. A. Sands, B. K. Waters, & J. R. 

McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 181— 
198). Washington, DC: American Psychological Association. 

Williams, V. S. L., Rosa, K. R., McLeod, L. D., Thissen, D., & Sanford, E. E. (1998). Projecting 
to the NAEP scale: Results from the North Carolina end-of-grade testing program. 
Journal of Educational Measurement, 35(4), 277-296. 

Yang, W. L. (2004). Sensitivity of linkings between AP multiple-choice scores and composite 
scores to geographical region: An illustration of checking for population invariance. 
Journal of Educational Measurement, 47(1), 33—41. 

Yin, P., Brennan, R. L., & Kolen, M. J. (2004, July). Concordance between ACT and ITED 

scores from different populations. Applied Psychological Measurement, 28(4), 274-289. 


20 



Appendix A 

The Standard Error of Equating Difference for Independent Subgroups’ Equating 

Functions (Equivalent Groups Design) 


Let the two subgroups be PI and P2. 

Let the row vector of partial derivatives of the equating function with respect to the score 


probabilities ofXand Y for each subpopulation be: 3 eYP] 
(de YP2 de YP2 \ 


( de Yn de YPl ^ 

l dr pl ’ ds pl J 


and 
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= Var(e YPl ) + Var(e YP2 ) 

Therefore, the SEED between independent subpopulations’ equating functions is 
yJVar(e m ) + Var{e YP2 ) . 
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Appendix B 

Derivatives of the RMSD with Respect to the Score Probabilities 
The Derivative of RMSD With Respect to r JPg . 


RMSD ( Xj ) = 
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By the chain rule, 
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By the chain rule, 
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Then 


de YP g ( x j) , de YP (x ) 

and 


dr 

u, jpg 


dr 
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can be computed from equations given in Holland, King, and Thayer (1989) and in von Davier et al. 


(2004). 


u> 


The Derivative of a YP With Respect to s kPg . 

The derivative of RMSD with respect to s k Pg first requires the differentiation of <j yp with respect to s k Pg and then an 
application of the quotient rule. 
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From the appendix in Holland et al. (1989), 



To obtain the derivative with respect to s k Pg , we apply the chain rule and multiply by the derivative of s kp with respect to s kPg 
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can be computed from equations given in Holland et al. (1989) and in von Davier et al. (2004). 





